Skip to content

[RFC] Device Array and compat Device Vector#2660

Open
danhoeflinger wants to merge 77 commits into
mainfrom
dev/dhoeflin/rfc_device_vector
Open

[RFC] Device Array and compat Device Vector#2660
danhoeflinger wants to merge 77 commits into
mainfrom
dev/dhoeflin/rfc_device_vector

Conversation

@danhoeflinger
Copy link
Copy Markdown
Contributor

@danhoeflinger danhoeflinger commented Apr 8, 2026

This RFC proposes adding an experimental device_array - a RAII container for USM device memory. It gives users a focused interface for managing device allocations and and provides the most-travelled usage patterns for device_vector.
The RFC also proposed adding a compat namespace and device_vector within it. device_vector wraps device_array and provides a similar set of functionality to Thrust's device_vector.

Four documents:

  • README.md - High level decisions, motivation, diagram
  • device_array.md - design and API for device_array
  • device_vector_compat.md - design and API for device_vector and helper classes
  • usage_pattern_study.md - survey of many thrust::device_vector and dpct::device_vector usages, includes analysis of alternatives and why they diverged from Thrust

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
@akukanov akukanov added the RFC label Apr 13, 2026
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
@danhoeflinger danhoeflinger marked this pull request as ready for review April 13, 2026 17:48
@danhoeflinger danhoeflinger added this to the 2022.13.0 milestone Apr 13, 2026
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Comment thread rfcs/proposed/device_vector/README.md Outdated
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
@danhoeflinger danhoeflinger force-pushed the dev/dhoeflin/rfc_device_vector branch from 4526d5c to 9721af1 Compare April 24, 2026 18:49
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
@danhoeflinger danhoeflinger changed the title [RFC] Device Vector [RFC] Device Array and compat Device Vector Apr 29, 2026
| Aspect | Proposed (oneDPL) | Thrust | sycl-thrust | SYCLomatic |
|---|---|---|---|---|
| **Default Allocator** | `device_allocator<T>` wrapping `sycl::malloc_device`; custom `DeviceAllocator` concept | `thrust::device_allocator<T>` (CUDA `cudaMalloc`) | `device_allocator<T>` (`sycl::malloc_device`); supports alignment template parameter | USM: `sycl::usm_allocator<T, shared>` / Buffer: `__buffer_allocator<T>` |
| **Memory Model** | **Device memory** via `sycl::malloc_device`; host access triggers explicit transfers | **Device memory** via `cudaMalloc`; host access triggers explicit transfers | **Device memory** via `sycl::malloc_device`; explicit transfers | **Shared memory** via USM shared or SYCL buffer/accessor; runtime manages placement |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "host access triggers explicit transfers" accurate? I mean, if something is triggered by another action, it is implicit and not explicit, no?

Rarely used in practice (see [usage study](usage_pattern_study.md)),
high implementation complexity for device memory.

- **Host-side operations block but do not synchronize with prior work.**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "block" mean in this sentence, and is it important in the context of design decisions?

Copy link
Copy Markdown
Contributor Author

@danhoeflinger danhoeflinger May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means they wait() until the operation is complete before returning, but no guarantee that previously queue work finishes before this starts. This is mentioned because it is a departure from the thrust::device_vector. I can make this more clear, and provide better justification for why this decision is made (fitting better with sycl, and also from the usage study showing user's preferences)

- Individual headers:
`<oneapi/dpl/device_array>` and `<oneapi/dpl/device_vector>`.. `device_vector`
would transitively include `device_array` since it depends on it.
- We could have a `compat` header and a individual `device_array` header. However, if we intend to use `device_array` within our own sycl implementations, that may impact our decision here.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite clear to me. What do you mean by "within our own sycl implementation", why would we want to use device_array there, and how that impacts the decision about headers?

Comment on lines +201 to +204
## `device_span<T>`

`device_array` is not device-copyable (it owns memory). For kernel capture,
non-owning views, and range composition, use `device_span<T>` via `.span()`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think our own span class is needed, because SYCL 2020 says that std::span and sycl::span, if supported by the implementation, must be device copyable. I suggest to remove device_span and the methods that return it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer to this, I can look into this. Id be happy to just rely upon sycl::span.

Comment on lines +20 to +21
template <typename T, typename Alloc = device_allocator<T>>
class device_array {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to be convinced that the allocator template parameter is needed for this class. It is needed for device_vector, sure, but why should this simplified container allow for an alternative allocator?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main argument from my perspective is that if we are building such a system for device_vector to have a "faithful" migration target, we will have and maintain that capability / complexity. If we have it, why not include it, and directly make device_vector depend upon device_array.

I think there are useful examples which can motivate having this feature, and it is easier to add at the beginning than later on: Memory pooling, alignment requirements, bookkeeping / debugging.

However, I'm open to the idea of having a third internal class which implements allocator friendly RAII memory management, and having both public classes hold a member of that internal class. device_array could hardcode a simple device_allocator with no frills. The question becomes whether we think there is enough utility in the feature to justify the API complexity. I don't have a strong opinion here really. The argument against is that the SYCL folks explicitly decided not to include an official usm device allocator.

Comment on lines +46 to +47
template <typename InputIt>
device_array(InputIt first, InputIt last, sycl::queue q);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move towards range-based APIs, instead of iterator-based? A separate constructor from std::vector would not be necessary if the generic one accepts any range. For C++17, we may add a named requirement for a range that matches the C++20 concept, as well as a simple to-range converter for a pair of iterators.

Comment on lines +66 to +75
// Device-to-device copy (allocates on the provided context+device)
// Supports cross-device copies: source and destination may be on different devices
static device_array copy_from(const device_array& src, sycl::queue q);
static device_array copy_from(const device_array& src,
size_type offset, size_type count, sycl::queue q);
static device_array copy_from(const device_array& src,
sycl::context ctx, sycl::device dev);
static device_array copy_from(const device_array& src,
size_type offset, size_type count,
sycl::context ctx, sycl::device dev);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions really look like a combination of the "uninitialized" constructor followed by an assignment, just with an additional offset argument. What are the benefits of those - or why copy operations are not appropriate?

Copy link
Copy Markdown
Contributor

@akukanov akukanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device_array still seems over-engineered to me,. I think it is better to start with tiny purposeful API and extend it on demand, rather than trying to put as much as possible from the beginning.

Actually, I would start with the use cases for this container (and code examples showing how it is supposed to be used), and design from those.

Comment on lines +46 to +47
template <typename InputIt>
device_array(InputIt first, InputIt last, sycl::queue q);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected category of iterators/ranges that can be used as the source of data? The name suggests that the requirements are minimal (input_iterator), but then how are the data copied to the device memory underneath the container?

Comment on lines +126 to +131
// Resize — new elements are uninitialized by default
void resize(size_type count);
void resize(size_type count, sycl::queue q);
// Resize — new elements filled with value
void resize(size_type count, const T& value);
void resize(size_type count, const T& value, sycl::queue q);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does resize operate within the container capacity, or do you want it to be truly resizable?

Copy link
Copy Markdown
Contributor Author

@danhoeflinger danhoeflinger May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is thinking with the idea that device_vector needs this functionality, and device_array handles its memory. Similar to the question on allocators, we could have an internal class which handles this functionality for both classes and "turn off" these public features for a no-frills device_array API. I think that is probably the best option.

Comment on lines +333 to +336
- **Should async overloads be in the initial proposal or deferred?**
This provides more control over synchronization than merely an in-order queue,
but it is unclear whether users who are wanting this would just want to work
with USM memory and memcpy directly.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, everything that is outside of the "minimally viable" scope should be deferred. Async operations seems to be there, but even some other operations might be there as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can cut this down to a minimal version.

Comment on lines +89 to +95
// Single-element host access (blocking, creates queue from context & device)
T read(size_type pos) const;
void write(size_type pos, const T& value);

// Single-element host access (blocking, provided queue is used for copy submissions)
T read(size_type pos, sycl::queue q) const;
void write(size_type pos, const T& value, sycl::queue q);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call these host_{read,write} for clarity.

Comment on lines +203 to +204
`device_array` is not device-copyable (it owns memory). For kernel capture,
non-owning views, and range composition, use `device_span<T>` via `.span()`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not quite understand why "it owns memory` means it is not device-copyable. The memory for data is clearly not embedded into the class layout, so the ownership is logical, not physical.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update this language, the technical reason is that a context is owned here. You are correct that the ownership is logical rather than physical. This class is not meant to be used directly on the device, but rather via a span or a raw pointer.

Copy link
Copy Markdown
Contributor

@akukanov akukanov May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is not meant to be used directly on the device, but rather via a span or a raw pointer.

And this is what bothers me. We set usage restrictions based on the anticipated implementation details, while ideally it should be the opposite.

I really would like this container to be usable as a sized random-access range on the device, so that its use with oneDPL algorithms would be natural and require no special support. For that, it should be device copyable. For that, we should either avoid embedding types that are not device copyable, or just ignore their existence and claim that it is device copyable anyway, designing the rest in such a way that operations allowed on a device never touch the parts that are not supposed to be there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, without some exterior system of storing contexts (in a global registry for instance), there are technical limitations within SYCL that make this challenging.

A (non device-copyable) context is the minimum requirement to do transfers and from the device, and deallocate memory. So either we store the context here, or must have a system for storing them globally and storing pointers, or searching through a stored list. I really don't like the idea of having such a global registry, but it is something we could do. Perhaps you have some other idea to get around this.

I don't think its necessarily bad or difficult to have this as the RAII "owning" container, and having lightweight non-owning ranges / iterators like a span or raw usm pointer when we want pass them around to use them in a kernel or oneDPL API. I think this separation is a feature, not a burden, in that it separates the scope of ownership from the usage. This is the same with device_vector, it is not used directly in cuda kernels, but rather after getting a non-owning device_ptr, or very commonly extracting the raw pointer and using that directly.

I much prefer thinking of device_array more similar to a std::vector than to a std::shared_ptr with reference counting or something like that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So either we store the context here, or must have a system for storing them globally and storing pointers, or searching through a stored list.

Why not just store it in the host allocated memory though? As far as I understand, the extra overhead would not be on the hot path, and likely rather negligible comparing to the overhead of allocating USM and transferring the actual data.

I much prefer thinking of device_array more similar to a std::vector

I fully agree. Now compare:

std::vector v(/*some args*/);
std::for_each(v.begin(), v.end(), lambda); // works
std::ranges::for_each(v, lambda); // works
tbb::parallel_for(0, v.size(), [&v](int i){ foo(v[i]); }); // works

dpl::device_array da(/*some args*/);
dpl::device_policy policy(q);
dpl::for_each(policy, da.begin(), da.end(), lambda); // works
dpl::ranges::for_each(policy, da, lamdba); // does not work, requires span or alike
q.parallel_for(da.size(), [=](sycl::id<1> i){ foo(da[i]); }); // does not work, requires da.data()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your code example is compelling, but also the TBB / vector combination does not include a host / device relationship where ownership of host side entities becomes challenging. Also, your final example uses the subscription operator from device_array on the device, which the current design does not allow for (it omits this operator). I think offering a subscription operator on this container would be very confusing if it did not provide implicit host device transfers similar to device_vector. My intention with device_array is that it provides more intentional and direct control over host vs device usage, while keeping most of the convenience of device_vector.

Please correct me if I misunderstand the suggestion, but from a technical standpoint, I think you are suggesting that when constructing a device_array we allocate some "host_state" struct which includes context, and only hold a pointer to that, allowing device_array to stay device-copyable. However, something eventually needs to destroy that host_state. Is it the destructor (after reference counting)?

Lets look at the rules for device copyable:

Type T has at least one eligible copy constructor, move constructor, copy assignment operator, or move assignment operator;
Each eligible copy constructor, move constructor, copy assignment operator, and move assignment operator is public;
The effect of each eligible copy constructor, move constructor, copy assignment operator, and move assignment operator is the same as a bitwise copy of the object;
Type T has a public non-deleted destructor; and
The destructor has no effect.

The copy and move constructors must be equivalent to bitwise copies, and the destructor should have no effect. We could maybe make this the case on the device by turning off aspects of the container with __SYCL_DEVICE_ONLY__, but I'm not sure it fits with device copyable. Even if we do, we are probably reference counting on the host side to see who is responsible for destroying the host_state and the device memory, which I'd prefer to avoid.

This is why I am in favor of having a simple host-side container for lifetime management, and a separate lightweight non-owning layer for actual usage (relying on existing usm pointer and sycl::span).

Copy link
Copy Markdown
Contributor

@akukanov akukanov May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to see an object that is passed to the device as a shadow copy of the object on the host, rather than an independent copy participating in the lifetime management. In other words, no reference counting involved, and instead we say that the host instance of the class is solely responsible for lifetime management and should remain alive for as long as any device code may access the data in any possible way, including via the USM pointers. That is the same requirement you have to have anyway, just extended to the shadow copies of the object as well.

And with that documented semantical restriction, making a shadow copy is equivalent to bitwise copy, and destroying a shadow copy has no effect. In other words, if the lifetime control conditions are met in the user code, the object is de facto device copyable - that is, a SYCL implementation can safely copy the bits back and forth as it wants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants